Deep fragment embeddings for bidirectional image sentence mapping

Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge

Feature Map after Front CNNs: 14x14x512 (widthxheightxchannel)
- each 1x1x512 is $\{a_i\}$
Implementation Detail by checking source code
- $f_{att}$: 2 layer-MLP
  - $f_{att}(\mathbf{a_i}, \mathbf{h}_{t-1})$ 's implementation
    - intermediate layers : $\mathbf{ctx} = \mathbf{W}_{D->D}\mathbf{a_i} + \mathbf{bias}$
    - final layer: $\mathbf{W}_{D->1}(\mathbf{ctx} + \mathbf{W}_{n->D}\mathbf{h}_{t-1}) + \mathbf{bias}$
- $f_{init,c}$: 2 layer-MLP
- $f_{init,h}$: 2 layer-MLP
Source Code
- https://github.com/kelvinxu/arctic-captions